18 research outputs found

    SNP-based pathway enrichment analysis for genome-wide association studies

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Recently we have witnessed a surge of interest in using genome-wide association studies (GWAS) to discover the genetic basis of complex diseases. Many genetic variations, mostly in the form of single nucleotide polymorphisms (SNPs), have been identified in a wide spectrum of diseases, including diabetes, cancer, and psychiatric diseases. A common theme arising from these studies is that the genetic variations discovered by GWAS can only explain a small fraction of the genetic risks associated with the complex diseases. New strategies and statistical approaches are needed to address this lack of explanation. One such approach is the pathway analysis, which considers the genetic variations underlying a biological pathway, rather than separately as in the traditional GWAS studies. A critical challenge in the pathway analysis is how to combine evidences of association over multiple SNPs within a gene and multiple genes within a pathway. Most current methods choose the most significant SNP from each gene as a representative, ignoring the joint action of multiple SNPs within a gene. This approach leads to preferential identification of genes with a greater number of SNPs.</p> <p>Results</p> <p>We describe a SNP-based pathway enrichment method for GWAS studies. The method consists of the following two main steps: 1) for a given pathway, using an adaptive truncated product statistic to identify all representative (potentially more than one) SNPs of each gene, calculating the average number of representative SNPs for the genes, then re-selecting the representative SNPs of genes in the pathway based on this number; and 2) ranking all selected SNPs by the significance of their statistical association with a trait of interest, and testing if the set of SNPs from a particular pathway is significantly enriched with high ranks using a weighted Kolmogorov-Smirnov test. We applied our method to two large genetically distinct GWAS data sets of schizophrenia, one from European-American (EA) and the other from African-American (AA). In the EA data set, we found 22 pathways with nominal P-value less than or equal to 0.001 and corresponding false discovery rate (FDR) less than 5%. In the AA data set, we found 11 pathways by controlling the same nominal P-value and FDR threshold. Interestingly, 8 of these pathways overlap with those found in the EA sample. We have implemented our method in a JAVA software package, called <it>SNP Set Enrichment Analysis </it>(SSEA), which contains a user-friendly interface and is freely available at <url>http://cbcl.ics.uci.edu/SSEA.</url></p> <p>Conclusions</p> <p>The SNP-based pathway enrichment method described here offers a new alternative approach for analysing GWAS data. By applying it to schizophrenia GWAS studies, we show that our method is able to identify statistically significant pathways, and importantly, pathways that can be replicated in large genetically distinct samples.</p

    Hobbes: optimized gram-based methods for efficient read alignment

    Get PDF
    Recent advances in sequencing technology have enabled the rapid generation of billions of bases at relatively low cost. A crucial first step in many sequencing applications is to map those reads to a reference genome. However, when the reference genome is large, finding accurate mappings poses a significant computational challenge due to the sheer amount of reads, and because many reads map to the reference sequence approximately but not exactly. We introduce Hobbes, a new gram-based program for aligning short reads, supporting Hamming and edit distance. Hobbes implements two novel techniques, which yield substantial performance improvements: an optimized gram-selection procedure for reads, and a cache-efficient filter for pruning candidate mappings. We systematically tested the performance of Hobbes on both real and simulated data with read lengths varying from 35 to 100 bp, and compared its performance with several state-of-the-art read-mapping programs, including Bowtie, BWA, mrsFast and RazerS. Hobbes is faster than all other read mapping programs we have tested while maintaining high mapping quality. Hobbes is about five times faster than Bowtie and about 2–10 times faster than BWA, depending on read length and error rate, when asked to find all mapping locations of a read in the human genome within a given Hamming or edit distance, respectively. Hobbes supports the SAM output format and is publicly available at http://hobbes.ics.uci.edu

    Computational Approaches Toward a Polyadenylation Code

    No full text
    Messenger RNA 3' polyadenylation (poly(A)) is an essential post-transcriptional processing step for most eukaryotic genes, significantly impacting many aspects of mRNA metabolism. The majority of eukaryotic genes present alternative poly(A) (APA), through which the same gene can have multiple alternative 3' ends due to the cleavage and poly(A) presence at distinct sites. APA results in RNA transcripts with different 3'UTRs, which can influence transcript transport, localization, stability, and translation, or lead to different protein products. Many human diseases including cancer have been associated with abnormal poly(A) regulation, highlighting the importance of this process. However, the rules on how poly(A) sites are selected and regulated -- the so called the poly(A) code -- are not well understood.Recent advances in high-throughput technologies have provided a great opportunity to elucidate the rules underlying APA. High-throughput sequencing(HTS) experiments yield a wealth of data regarding APA. Consequently, there is a need to develop computational techniques to mine these data. In this thesis, we present four major contributions furthering our understanding of the poly(A) code. The algorithms and computational methods we developed have all showed improved predictive and analytical capabilities over competing methods. They are as follows:1) HTS reads need to be efficiently mapped back to a reference genome for further downstream analysis. To address this need, we developed a fast and accurate reads mapping package for identifying all mapping locations for each read, called "Hobbes". Hobbes outperforms most state-of-the-art "all-mapping" programs, including mrsFast and Razers2.2) We developed a bioinformatics pipeline for identifying and profiling genes with significant APA switches from different biological or clinical conditions. The pipeline includes calling poly(A) sites, filtering artificial poly(A) sites, clustering heterogeneous poly(A) sites, and identifying and profiling genes with significant APA switches. This pipeline has already provided significant insights into many core polyadenylation factors.3) The poly(A) code can be partially deciphered from the genome-wide modeling of tissue-specific APA. Consequently, extended existing Shannon entropy measuring to assess the tissue specificity for each poly(A) site, and applied an outlier detection method to identifying the tissue-specific pattern. With new mRNA features we explored, our ensemble predictive model successfully discriminated tissue-specific poly(A) sites from constitutive poly(A) sites, with test accuracy 84.5% (auRoc 0.92), which surpassed the previous model by more than 10%. Through an in-depth analysis of the most important features, we proposed a mechanism that controls the selection and regulation of tissue-specific APA. 4) Aberrant mRNA 3' polyadenylation have been implicated for a wide variety of complex diseases. We developed a novel statistical method for identifying disease-related pathway from genome-wide association studies (GWAS). We proposed to optimally select a representative SNP (single nucleotide polymorphism) set for each gene using adaptive truncated product statistic, and conducted enrichment analysis via the weighted Kolmogorov-Smirnov test to identify enriched pathways. By applying it to Schizophrenia GWAS SNPs, we showed our method identifies pathways highly associated with the disease. Moreover, the results are reproducible across large genetically distinct samples. This method can be used for detecting pathways involved in disease caused by APA, such as cancer

    Stochastic Subgradient Methods

    No full text
    Stochastic subgradient methods play an important role in machine learning. We introduced the concepts of subgradient methods and stochastic subgradient methods in this project, discussed their convergence conditions as well as the strong and weak points against their competitors. We demonstrated the application of (stochastic) subgradient methods to machine learning with a running example of training support vector machines (SVM) throughout this report.

    untitled

    No full text

    Poly(A) code analyses reveal key determinants for tissue-specific mRNA alternative polyadenylation.

    No full text
    mRNA alternative polyadenylation (APA) is a critical mechanism for post-transcriptional gene regulation and is often regulated in a tissue- and/or developmental stage-specific manner. An ultimate goal for the APA field has been to be able to computationally predict APA profiles under different physiological or pathological conditions. As a first step toward this goal, we have assembled a poly(A) code for predicting tissue-specific poly(A) sites (PASs). Based on a compendium of over 600 features that have known or potential roles in PAS selection, we have generated and refined a machine-learning algorithm using multiple high-throughput sequencing-based data sets of tissue-specific and constitutive PASs. This code can predict tissue-specific PASs with &gt;85% accuracy. Importantly, by analyzing the prediction performance based on different RNA features, we found that PAS context, including the distance between alternative PASs and the relative position of a PAS within the gene, is a key feature for determining the susceptibility of a PAS to tissue-specific regulation. Our poly(A) code provides a useful tool for not only predicting tissue-specific APA regulation, but also for studying its underlying molecular mechanisms
    corecore